Strawberry Detection Through Deep Learning¶
HS 495 Project 3
Jerry H. Yu
Introduction¶
Computer Vision (CV) tools are receiving increasing attention in agricultural fields after demonstrating their ability to increase throughput and cost effectiveness over traditional phenotyping methods $^{1}$. Computer vision, a subfield of Artificial Intelligence (AI), focuses on deriving meaningful information from visual data like images and videos $^{2,3}$. One application of CV to agricultural use is fruit detection, where a model learns to find and count ripe fruit like strawberries from video or photos of plants in the field $^{4}$. These methods can accelerate plant breeding by automating the process of counting strawberries to assess plant productivity, or guide robots to pick ripe strawberries $^{4}$.
However, applying CV techniques in agricultural settings present unique challenges. Two common hurdles are the limited availability of labeled training data and the computational demands of running large models on less powerful devices in the field $^{3}$. For fruit detection, these challenges are particularly acute, requiring sufficient labeled images of strawberries at varying ripeness levels, under diverse lighting conditions, and accounting for natural occlusion from leaves and stems $^{5,6}$.
The state-of-the-art (SOTA) in computer vision currently relies on two primary classes of deep learning algorithms: Convolutional Neural Networks (CNNs) and Visual Transformers. CNNs, like the popular YOLO, ResNet, and MobileNet families, extract relevant features through a series of convolutions like blurring $^{7,8}$. Throughout the 2010s, CNNs were the dominant approach, and they remain widely used when speed and model size are critical. More recently, Visual Transformers have emerged, processing images by dividing them into tokens and embedding them as vectors, allowing them to leverage transformer-based architectures and incorporate broader contextual information, often at the cost of increasing run (inference) time and model size $^{9,10}$. Models of this type include Vision Transformers (ViT) from Google, Florence2 from Microsoft, and Qwen2.5-VL from Alibaba $^{9}$.
In this class, we were tasked with training a deep learning model to detect ripe strawberries. Previous research has demonstrated the effectiveness of YOLO models in strawberry detection, achieving high accuracies, recall, and F1 scores $^{5,6}$. However, these studies often predate the release of newer YOLO architectures and are trained on much larger labeled datasets. Therefore, this project will evaluate and compare the performance of recent YOLO models (YOLOv11 and YOLOe) alongside a transformer-based Qwen model on a very small dataset $^{9–11}$. Our primary goals are to assess the models’ performance with limited training data, computational cost, and speed, providing insights into the most efficient and effective approach for automated strawberry phenotyping.
Image Preprocessing and Augmentation¶
Preprocessing¶
- For labeling, me, Shailesh Raj Acharya, and Steve Ameridge collaborated on labeling images.
- The original data set (V1) contained 63 images.
- Later, I added 8 additional images. These were mostly null images, or images without any ripe strawberries to try to obtain a better precision. That is, I tried to add images with no ripe strawberries to make the model less likely to identify something that was not a raw strawberry as a raw strawberry. This version is called V4.
- Data was split with a standard 70-20-10 approach, with 70% of images assigned to training, 20% to validation, and 10% to testing.
- Images for training were duplicated three times, resulting in three versions of the same image.
- An example of the types of images used in the dataset are below.
# Load images
img1 = Image.open(os.path.join(repo.working_tree_dir, "Data/Project3---Strawberry-1/train/images/39_crop1_png.rf.9ed6f32f0000a8290b62ea41e167cfff.jpg"))
img2 = Image.open(os.path.join(repo.working_tree_dir, "Data/Project3---Strawberry-4/train/images/Fragaria_berried_treasure_pink_everbearing_strawberry_gc_FRABP_03_jpg.rf.e2b63a188d49dfdb3b564a6ddf3bdc73.jpg")
)
# # Resize (optional)
# img1 = img1.resize((300, 300))
# img2 = img2.resize((300, 300))
# Combine side by side
combined = Image.new("RGB", (img1.width + img2.width, max(img1.height, img2.height)))
combined.paste(img1, (0, 0))
combined.paste(img2, (img1.width, 0))
# Display image
display(combined)
# Add caption
display(Markdown("**Figure 1.** Left: An example of an image from the original dataset. Right: An example of a null image I added. Notice the color of the flowers to train the model to account for features other than color."))
Figure 1. Left: An example of an image from the original dataset. Right: An example of a null image I added. Notice the color of the flowers to train the model to account for features other than color.
Augmentation¶
- Originally, I randomly chose a combination of image augmentations I thought would be beneficial, tending to the high range of what roboflow recommended. This is in addition to the automatic augmentations applied by the
trainfunction from Ultralytics. - For all models, I added noise of up to 1.05% of pixels.
- in later versions, I left the images unaugmented for roboflow with the exception of noise, relying on the automatic augmentations only and tuning augmentation parameters alongside other hyperparameters.
- Table 1 describes all augmentations applied across models that were fine tuned.
| Row Name | Yolo11 Finetuned Default | Yolo11 Finetuned Best of 10 | Yolo11 Finetuned Best of 274 |
|---|---|---|---|
| degrees rotated | 0.0 + 29 | 0.0 | 0.0 |
| hsv_h (hue) | 0.015 | 0.01458 | 0.01287 |
| hsv_s (saturation) | 0.7 + 0.21 | 0.09784 | 0.09813 |
| hsv_v (value) | 0.4 | 0.1 | 0.1 |
| translate | 0.1 | 0.1 | 0.06559 |
| scale | 0.5 | 0.5 | 0.5 |
| shear | 0.0 | 0.0 | 0.0 |
| perspective | 0.0 | 0.0 | 0.0 |
| flipud | 0.0 | 0.0 | 0.0 |
| fliplr | 0.5 | 0.5 | 0.5 |
| bgr | 0.0 | 0.0 | 0.0 |
| mosaic | 1.0 | 1.0 | 1.0 |
| mixup | 0.0 | 0.0 | 0.0 |
| copy_paste | 0.0 | 0.0 | 0.0 |
| copy_paste_mode | flip | flip | flip |
| auto_augment | randaugment | randaugment | randaugment |
| erasing | 0.4 | 0.4 | 0.4 |
| crop_fraction | 1.0 | 1.0 | 1.0 |
Table 1: All Augmentation settings for models that were fine tuned. Bold indicates differences between models, while the + sign indicates augmentations from roboflow, which were only applied in the first model. For further information on augmentation hyperparameters please refer to the Roboflow docs here $^{12}$.
Model Selection¶
I chose three models for my analysis. A description for each model is given below:
YOLOv11is the latest CNN-based model from Ultralytics, optimized for both speed and accuracy in object detection tasks¹¹. Compared to the YOLO versions shown in class,YOLO11employs a series of architectural tweaks that make parts of the model more efficient by requiring fewer parameters, enabling faster inference without significantly affecting performance¹³.- I chose the large version to increase the model's base performance. All YOLO models are relatively easy to train with my hardware, so picking the one with the best results was preferable¹¹,¹².
YOLOEbuilds onYOLO11by incorporating two sub-networks that allowYOLOEto detect objects from custom descriptions without additional training¹⁰. These components are as follows:SAVPE (Semantic-Activated Visual Prompt Encoder) enhances visual salience by linking image regions to semantic features from text prompts¹⁴.
RepRTA (Re-parameterizable Region-Text Alignment) transforms text descriptions (e.g., “red ripe strawberry”) into high-dimensional embeddings, aligned with CLIP-like vectors, enabling the model to detect custom or unseen objects without retraining¹⁴.
Qwen-2.5 VL 3b Instructis a multimodal vision language model that combines a customized Vision Transformer (ViT) with the QWen 2.5 LLM⁹. Qwen's ViT uses enhanced positional embeddings and multimodal fusion layers to align image features with text queries. It's pre-trained on large image-text datasets and then on vision-language tasks such as region detection¹⁴.- As it is an MLLM, this model is much larger than the other two¹⁰,¹¹,¹⁵.
I feel that these models give us a good overview of different SOTA CV approaches, representing the trade-off between contextual understanding and processing speed.
Results¶
# Checking GPU
!nvidia-smi
# Check torch version and compatability
print(torch.version.cuda)
print(torch.cuda.is_available())
print("CUDA device count:", torch.cuda.device_count())
Test Yolo11 Pretrained¶
Note: We used an image of a cat with mangosteens to demo each model's abilities without any training (zero shot).
from ultralytics import YOLO
# checks if ultralytics is up to date and if the GPU is available
ultralytics.checks()
Ultralytics 8.3.105 Python-3.11.11 torch-2.8.0.dev20250407+cu128 CUDA:0 (NVIDIA GeForce RTX 3090, 24576MiB) Setup complete (32 CPUs, 95.7 GB RAM, 994.4/1862.2 GB disk)
# Load Pretrained YOLO11
yolo11 = YOLO(os.path.join(repo.working_tree_dir,"Models/pretrain/yolo11l.pt"))
pic = Image.open(os.path.join(repo.working_tree_dir,"Data/mangosteencat.jpeg"))
result = yolo11.predict(pic,conf=0.25)[0]
detections = sv.Detections.from_ultralytics(result)
0: 640x640 1 cat, 1 cup, 1 bowl, 1 apple, 10.9ms Speed: 9.9ms preprocess, 10.9ms inference, 128.8ms postprocess per image at shape (1, 3, 640, 640)
box_annotator = sv.BoxAnnotator()
label_annotator = sv.LabelAnnotator(text_color=sv.Color.BLACK)
annotated_image = pic.copy()
annotated_image = box_annotator.annotate(annotated_image, detections=detections)
annotated_image = label_annotator.annotate(annotated_image, detections=detections)
sv.plot_image(annotated_image, size=(10, 10))
- You can see that
YOLO11is able to detect quite a few objects and correctly. - However, the model's vocabulary is limited, resulting in it identifying mangosteens as apples.
# Now Test on Strawberries
pic = Image.open(os.path.join(repo.working_tree_dir,"Data/Project3---Strawberry-1/test/images/27_crop2_png.rf.89569ab2704bf6aa15acab6150fe12b8.jpg"))
result = yolo11.predict(pic,conf=0.25)[0]
detections = sv.Detections.from_ultralytics(result)
0: 640x640 1 apple, 12.5ms Speed: 1.7ms preprocess, 12.5ms inference, 2.3ms postprocess per image at shape (1, 3, 640, 640)
box_annotator = sv.BoxAnnotator()
label_annotator = sv.LabelAnnotator(text_color=sv.Color.BLACK)
annotated_image = pic.copy()
annotated_image = box_annotator.annotate(annotated_image, detections=detections)
annotated_image = label_annotator.annotate(annotated_image, detections=detections)
sv.plot_image(annotated_image, size=(10, 10))
- very inaccurate, seems think that all fruits are apples.
Start Finetune¶
- Now that we've seen
YOLO11zero shot effectivness on our data, let's fine tune it.
# Import Roboflow Project Version 1
os.chdir(repo.working_tree_dir + "\\Data")
rf = Roboflow(api_key ="api")
rf.workspace().project("project3-strawberry").version(1).download("yolov11")
loading Roboflow workspace... loading Roboflow project...
<roboflow.core.dataset.Dataset at 0x21bb7f29250>
# Train the model under default settings
os.chdir(repo.working_tree_dir)
v1yolo11 = yolo11.train(data=os.path.join(repo.working_tree_dir,"Data\\Project3---Strawberry-1\\data.yaml"),
model = os.path.join(repo.working_tree_dir,"Models/pretrain/yolo11l.pt"),
epochs=100, imgsz=640,name="V1default",val=False)
os.chdir(repo.working_tree_dir + "\Validation")
v1yolo11 = YOLO(os.path.join(repo.working_tree_dir,"Models/finetune/V1default.pt"))
metricsdefault = v1yolo11.val(data = os.path.join(repo.working_tree_dir,"Data\\Project3---Strawberry-1\\data.yaml"),
plots=True)
Ultralytics 8.3.105 Python-3.11.11 torch-2.8.0.dev20250407+cu128 CUDA:0 (NVIDIA GeForce RTX 3090, 24576MiB) YOLO11n summary (fused): 100 layers, 2,582,347 parameters, 0 gradients, 6.3 GFLOPs
val: Scanning C:\Users\Public\Documents\Class_Projects\Strobbery\Data\Project3---Strawberry-1\valid\labels.cache... 12 images, 0 backgrounds, 0 corrupt: 100%|██████████| 12/12 [00:00<?, ?it/s]
Class Images Instances Box(P R mAP50 mAP50-95): 100%|██████████| 1/1 [00:02<00:00, 2.23s/it]
all 12 113 0.715 0.735 0.718 0.335
Speed: 0.1ms preprocess, 1.2ms inference, 0.0ms loss, 0.5ms postprocess per image
Results saved to runs\detect\val2
- For default setting with few epochs, the mAP50 (0.718) and mAP50-95 (0.335) are not bad.
- We take these values as a baseline
# Load images
img1 = Image.open(os.path.join(repo.working_tree_dir,
"Validation/runs/detect/val/confusion_matrix.png"))
img2 = Image.open(os.path.join(repo.working_tree_dir,
"Validation/runs/detect/val/F1_curve.png")
)
img3 = Image.open(os.path.join(repo.working_tree_dir,
"Validation/runs/detect/val/PR_curve.png")
)
# # Resize (optional)
img1 = img1.resize((2000,1500))
# img2 = img2.resize((300, 300))
# Combine side by side
combined = Image.new("RGB", (img1.width + img2.width + img3.width, max(img1.height, img2.height)))
combined.paste(img1, (0, 0))
combined.paste(img2, (img1.width, 0))
combined.paste(img3, (img1.width + img2.width, 0))
# Display image
display(combined)
# Add caption
display(Markdown("**Figure 2.** The confusion matrix, F1 curve, and PR "))
Figure 2. The confusion matrix, F1 curve, and PR
- From these graphs, we see that the number of false positives seems to outnumber the number of false negatives.
- F1 also attains a stable value quickly, before a slower drop off. This is mirrored in the PR curve where the plateuing of precision as recall increases is earlier than the plateuing of recall as precision increases.
- This indicates that the majority of false positives get lower confidence, whereas many true positives get mid confidences.
- This could then indicate that many true positives are harder to detect.
Let's check this
# Load images
img1 = Image.open(os.path.join(repo.working_tree_dir,
"Validation/runs/detect/val/val_batch0_labels.jpg"))
img2 = Image.open(os.path.join(repo.working_tree_dir,
"Validation/runs/detect/val/val_batch0_pred.jpg")
)
# # Resize (optional)
# img1 = img1.resize((2000,1500))
# img2 = img2.resize((300, 300))
# Combine side by side
combined = Image.new("RGB", (img1.width + img2.width, max(img1.height, img2.height)))
combined.paste(img1, (0, 0))
combined.paste(img2, (img1.width, 0))
# Display image
display(combined)
# Add caption
display(Markdown("**Figure 2.** Validation Plots. Left:labels. Right: Predictions."))